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Abstract 

In data fusion analysts seek to combine information from two databases comprised of disjoint 
sets of individuals, in which some variables appear in both databases and other variables appear 
in only one database. Most data fusion techniques rely on variants of conditional independence 
assumptions. When inappropriate, these assumptions can result in unreliable inferences. We 
propose a data fusion technique that allows analysts to easily incorporate auxiliary information 
on the dependence structure of variables not observed jointly; we refer to this auxiliary infor¬ 
mation as glue. With this technique, we fuse two marketing surveys from the book publisher 
HarperCollins using glue from the online, rapid-response polling company CivicScience. The 
fused data enable estimation of associations between people’s preferences for authors and for 
learning about new books. The analysis also serves as a case study on the potential for using 
online surveys to aid data fusion. 
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1 Introduction 


In many applications in marketing, analysts seek to eombine information from two or more databases 


eontaining information on disjoint sets of individuals and distinet sets of variables (Kamakura and 


Wedelj |1997t |van der Putten et al.| [2002t [Kamakura et al.j |2003t IGilula et all |2006t [van Hattuin 


and Hoijtink[ 20081. For example, a company has one database on customers’ purehasing habits 


and another database on individuals’ media viewing habits, and seeks to find assoeiations between 
viewing and purehasing habits (Gilula et al.[ 20061. This setting, known as data fusion (Rassler 


2002 p. 60 - 63), arises in other eontexts, ineluding mierosimulation modeling in eeonomies (Mo- 


riarty and Seheuren 2003) and government statisties (D’Orazio et al.[ 2002). For applieations in 


other areas, see Kadane (2001, reprinted from a 1978 manuseript), Rodgers (1994), Moriarty and 


Seheuren] ( 1200 1| ), and p’Orazio et al.| ( |2006| ). 


Typieal applications of data fusion rely on strong and unverifiable assumptions about the re¬ 
lationships among the variables. To see this, eonsider fusion of two databases, Di and D 2 , with 
disjoint sets of individuals. Let A denote the set of variables eommon to both databases, sueh as 
demographies; let B denote the set of variables unique to Di; and let B' denote the set of vari¬ 
ables unique to D 2 . Since {A,B,B'} are never observed simultaneously, the joint distribution 
of {A, B, B'} is not identifiable based on {Di,D 2 ) alone. Neither is the distribution of {B, B'}, 
either marginally or eonditionally on A. Put another way, many possible speeifieations of the joint 
distributions of {A, B, B'} may be eonsistent with the marginal distributions of {A, B} in Di and 
{A, B'} in D 2 . The data provide no information on whieh speeiheations to favor. 

For data fusion to proceed, analysts must make some assumption about the joint distribution of 
{A, B, B'}. The most common assumption is that the variables in B are eonditionally independent 
of those in B', given the variables in A (Kiesl and Rassler 2006 D’Orazio et al.[ 2006 [ Gilula 


et al. 2006). For example, assume that every person with the same age, gender, oeeupation, raee. 


eounty of residenee, ete., has the same probability of purehasing the produet, regardless of their 
media viewing habits. While this assumption eould be reasonable in some contexts with rich A 
variables, it also could be grossly incorrect. For example, in some demographie groups, people 
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who watch advertising infrequently may be less likely to purchase the produet. When this is the 
case, assuming conditional independence ean result in inferences about {A^B^B'} that do not 
accurately refleet the underlying relationships in the population. 

To reduee relianee on eonditional independenee assumptions, analysts require some form of 
auxiliary information. For example, analysts ean use knowledge about the joint distribution of 


{5, B'} from other sources to bound the joint distribution of {A, B, B'} (D’Orazio et al. 20061. 
Another possibility is to mount a new data eollection that provides information on unknown fea¬ 
tures of the joint distribution of {A, B, B'}. Historieally, sueh surveys have been untimely and 
prohibitively expensive. However, in reeent years teehnologieal advanees have opened the door to 
fielding rapid response, low eost surveys ( Gilula and MeCulloeh[ 20131. Questions then arise as to 
how analysts ean leverage the information in sueh surveys for more aeeurate data fusion. 

In this article, we propose a data fusion approach that allows analysts to incorporate auxiliary 
information on arbitrary subsets of {A, B, B'} with at least one variable in B and B' jointly ob¬ 
served. We refer to sueh auxiliary information as glue, sinee it serves to strengthen the eonneetion 
between B and B'. We present the approaeh for the eommon setting of all eategorieal variables, 
although similar strategies eould be used for numerieal variables. The basie idea is to eolleet or 
construct a dataset that represents the auxiliary information, append this dataset to the eoneate- 
nated file {Di, D 2 ), and fit an imputation model to predict missing B in D 2 and missing B' in Di. 
As the engine for imputation, we use a Bayesian latent elass model (Dunson and Xing] 2009 Si 


and Reiter 2013). Using simulation studies, we illustrate how to aeeommodate glue of various 


sizes and on various variable subsets, and demonstrate the potential for glue to improve aeeuraey 
relative to fusion proeedures that assume eonditional independenee. We also discuss problems 
that ean arise when using glue from a non-representative sample, and propose methodology for 
ineorporating non-representative glue in data fusion. We apply the methodology in a data fusion 
experiment in whieh we obtain glue from the internet polling company CivicScience, and use the 
glue to fuse surveys fielded by the book publisher HarperCollins Publishers on author preferenees 
and author discovery tendeneies. 
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The remainder of the artiele is organized as follows. In Seetion we introduee the Harper- 
Collins data fusion eontext and review typieal approaehes to data fusion in the literature. In Seetion 
we deseribe how to adapt Bayesian latent elass models for data fusion to aeeommodate glue. The 


approaeh allows for both the ereation of eompleted data files, i.e., as in multiple imputation (Ru¬ 


bin[ 1986[ 1987; Reiter[ 2012), as well as parameter inferenee. We foeus on ereating eompleted 


datasets, whieh ean be subsequently analyzed using the teehniques of Rubin (1987). We also sum¬ 
marize results of simulation studies that demonstrate the benefits of leveraging glue in data fusion. 
In Seetion]^ we present results of the HarperCollins Publishers’ and CivieSeienee data fusion. In 
Seetion 1^ we eonelude with a diseussion of open questions and future researeh direetions. 


2 Background 

2.1 HarperCollins data and CivieSeienee glue 

HarperCollins Publishers routinely administers surveys to the publie to learn about their behaviors 
and opinions, relying on this information to guide business deeisions. The surveys typieally inelude 
questions about basie demographies (e.g., age, ineome, gender) and reading habits, as well as 
questions on foeused topies sueh as teehnology usage or author preferenees. Generally, around 
10% of questions in the surveys address basie demographies and reading habits, and the remaining 
90% are speeifie to the survey. We seek to fuse data from two HarperCollins surveys, one ineluding 
questions on the authors people read and the other ineluding questions on where people diseover 
new authors (e.g., Faeebook and Best Sellers lists). The first survey eomprises 4,001 respondents 
and 734 variables; we use only a subset of questions related to diseovery and demographies. The 
seeond survey eomprises 5, 015 respondents and 1,433 variables; we use only a subset of questions 
related to author readership and demographies. The surveys were administered by an independent 
eompany to a random sample of people residing in the United States, with pre-speeified numbers 
of individuals in speeifie eategories based on age, gender, ethnieity, and geographie regions. 

HarperCollins is interested in understanding the demographies of readers of partieular authors 
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and how to reach them. For example, if HarperCollins publishes a new book by the author Lisa 
Kleypas, will they reach more of her readers by advertising the new book in bookstores or on Face- 
book? Furthermore, who should be the target audience (age, gender, etc.) of the advertisements? 
Leveraging the connections between author readership, book discovery, and demographics across 
surveys can help HarperCollins pursue profitable marketing strategies. 

To obtain glue for the data fusion, we collaborated with internet polling company Civic- 
Science Q Internet polling companies are potentially ideal glue collectors, as they are able to survey 
thousands of people daily at low cost. As case in point, CivicScience collects hundreds of thou¬ 
sands of responses per day and has information stored on millions of respondents. CivicScience is 
routinely paid by other companies to canvass the public on marketing and business decisions. 

CivicScience obtains information by posting short surveys, typically three or four questions, on 
the sidebar of popular websites. Participation is purely voluntary (raising the potential for selection 
bias, which we return to later). CivicScience entices participation by beginning each survey with 
an engagement question that people are often willing and eager to share their opinion on (e.g., 
“Who will win the Superbowl?”)■ The next question(s) is a value question asked on behalf of a 
paying client. The final question inquires about respondent demographics. After completing the 
short survey, participants are offered the option to answer additional questions. CivicScience uses 
participants’ computer IP addresses to link responses from the same individuals (more accurately, 
from the same computer). 

For our application, CivicScience ran numerous three-question surveys on author readership 
and discovery. The second question was about either author readership or discovery, and the third 
question was about either the respondent’s age or gender. Many participants completed more than 
one survey, allowing CivicScience to link responses on author readership, discovery, age, and 
gender. We use these linked data in the fusion of the HarperCollins surveys. 

'Mark Cuban, the high-profile owner of the Dallas Mavericks and Shark Tank investor, was quoted in the Pittsburgh 
Tribune Review in 2013 stating, “CivicScience is one of the most exciting companies 1 have seen in a long time. Their 
ability to predict consumer behavior in media, retail sales, and even politics has virtually unlimited potential.” 
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2.2 Common data fusion methods 


The most widely used data fusion technique in practice is statistical matching (van der Putten et al. 


2002 Wicken and Elms 2009). The analyst divides the observations in {Di , D 2 ) into groups based 


on the similarity of values in the A variables. Within each group, the analyst imputes missing B 
values for records in D 2 by sampling from the empirical distribution of B' in that group. The 
analyst imputes missing B' values for records in Di in a similar manner. Often one cannot find 
groups of records in Di and D 2 with exactly the same values on all of A, particularly when the 
contingency table implied by the variables in A has a large number of cells. In such cases, analysts 
form groups based on some subset of A variables. Alternatively, analysts specify some distance 
function that quantifies how “close” the A values are for a given pair of observations from Di and 
D 2 , and form groups based on the close matches. Regardless of how the analyst forms groups, 
these approaches all make the unverifiable assumption that B is independent of B' within the 
analyst-specified groups. 

A second approach to data fusion is to estimate regression models for the distributions of 
{B I A) from Di and {B' \ A) from D 2 , and set f{B,B' \ A) = f{B \ A)f{B' \ A), i.e., 
assume conditional independence between B and B' ( Rodgers| 1994; Gilula et'H^ 20061. One 
then imputes missing values of B using the estimated model for (B \ A), and imputes missing 


values of B' using the estimated model for (B' \ A). Gilula et al. (2006) describe how to adapt this 
regression-based approach to incorporate auxiliary information about the dependence between a 
single binary B and a single binary B'. 

A third approach is to estimate models for the entire joint distribution of {A, 5, B'}. For exam¬ 
ple, one could use a multinomial distribution with probabilities constrained by a log-linear model 
that excludes terms involving interactions between B and B'. This also assumes conditional inde¬ 


pendence between B and B'. D’Orazio et al. (2006) describe how this conditional independence 
assumption can be relaxed in log-linear models by incorporating auxiliary information on marginal 
probabilities for {B, B'). Alternatively, one could estimate the joint distribution of {A, B, B'} with 
a latent class model ( [Goodman] |1974[ ), as suggested by [Kamakura and Wedel| ( [T997] ) and as we do 
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here. Unlike log-linear models, latent elass models ean eapture eomplex assoeiations among the 
variables automatieally, avoiding the diffieult task of deeiding whieh interaetions to inelude from 
the enormous spaee of possible models ( [Vermunt et al.[ |2008[ |Si and Reiter[ |2013| ). Latent elass 
models also easily handle missing values in Di and D 2 due to item nonresponse within the surveys, 


assuming nonresponse is missing at random (Rubin 19761. However, we are not aware of method¬ 
ology for ineorporating auxiliary information when using latent elass models in data fusion. We 
now introduee sueh methodology. 


3 Methodology 

3.1 Bayesian latent class models for categorical data fusion 

Suppose that we seek to fuse database Di eomprising rii individuals with database D 2 eomprising 
n 2 individuals. Let Yij G {1, ..., dj} be the value of variable j for individual i, where j = 1,... ,p 
and i = 1,... ,ni + n 2 . Let Yi = (1^1,..., Yip) for all i. The p variables form a eontingeney 
table with 11^=1 cells. For variables j G A, we observe Y^j for all n = rii + n 2 individuals; 
for variables j G B, we observe Yij for only the ni individuals in Dp, and, for variables j G B', 
we observe Y^j for only the n 2 individuals in D 2 . We note that, in praetiee, item nonresponse will 
result in unintentionally missing values within Di and D 2 as well. 

In latent elass models for eategorieal data, we assume that eaeh individual is a member of one 
of N unobserved elasses. Let Zj G {1,..., denote individual Ls elass membership, and let 
Til = P(Zi = 1) be the probability that individual i is in elass 1. We assume that vr = (tti, ..., ttat) 
is the same for all individuals. Within eaeh elass, we assume the variables follow independent 
eategorieal distributions with variable-speeifie probabilities where (p^y = 

^{Yij = y I Zi = /). As a flexible and eomputationally eonvenient prior distribution on tt and 


{0p^}, we use the truneated version of the Diriehlet Proeess (DP) prior (Sethuraman 1994). The 


eomplete model, referred to as the DP mixture of produets of multinomials (DPMPM), ean be 
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expressed as: 


p 


Ya,. 

. ., Yip\Zi, 0 Yl categorical(Fy; 4% ■■■, 4^4 * = • • 

j = l 

. ,n 

(1) 


Zi 71 categorical (tti, ... ,7rjv), i = 1,... ,n 


(2) 


l-l N-l 

TT; = Vi ]^(1 - Vr), TTAf = 1 - TT; 

r=l l=l 


VJ I a ~ beta(l, a), = 1, Z = 1,..., iV — 1 

4^^ Dir(aS^\ ..., Z = 1,..., j = 1,... ,p 

a ~ gamma(aQ,, ba)- (3) 


The parameter a plays a eentral role in determining the number of effeetive eomponents in the 
mixture, with smaller values favoring fewer eomponents. A hyperprior on a allows the data to 
inform the number of eomponents. In our applieations, we fix and ba equal to 0.5 in the prior 
distribution in Q, whieh represents a relatively noninformative prior. We set = ■ ■ ■ = = 1 

for all j. 

We estimate the DPMPM model using Markov ehain Monte Carlo (MCMC) posterior sim¬ 
ulation teehniques ( jlshwaran and Zarepour 2000 Ishwaran and James 20011. The missing Yij, 
unforeseen from item nonresponse and expeeted due to the the strueture of data fusion, are imputed 
as part of the MCMC. Given a draw of model parameters (a, Z, V, tt), we sample a value 

for eaeh missing Y^j from the relevant independent eategorieal distribution in elass Zi. Further 
details on the sampling algorithm are provided in the Appendix. 

The probability model defined in ([T]) and Q is the same as that used by Kamakura and Wedel 


(1997 ). However, rather than use a fully Bayesian estimation approaeh, they maximize the likeli¬ 
hood funetion obtained from equations Q and Q. Additionally, [Kamakura and Wedel] ( |T997| ) use 
heuristies to determine some optimal number of elasses, whereas with the DPMPM one simply ean 


fix the truneation level to a large value (Ishwaran and James, 20011. To ensure that N is large 
enough, the analyst eonfirms that the number of oeeupied elasses n* is always signifieantly less 
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than N across MCMC samples. If the posterior distribution for n* plaees significant mass near N, 
then N should be inereased. In the analyses in this article, iV = 30 is always suffieiently large. 

Even though variables are independent within the latent elasses, variables still ean be marginally 
dependent aeross the set of elasses. For example, for any pair of variables j and j', we have 


N 


P{Yij = y, Yij, = y' \tt, ^ • 


(4) 


1=1 


In general, the expression in Q is not identieal to the produet of the two marginal probabilities. 


T.i=iT^in,y) (E«=i TTi(j)iy, j, implying Yij and Yiy are independent eonditional on Zi and }, 
but dependent upon marginalization over Z^. Expression Q ean be used for model-based infer- 
enees about probabilities. 


As suggested by Gilula et al. (2006) when diseussing the model used by Kamakura and Wedel 


(1997), estimates of the joint distribution of {A, i?, B'} from latent elass models may not be eon- 
eordant with eonditional independenee. In our simulations, we found that the DPMPM favors 
somewhat stronger eorrelation between B and B' than is implied under eonditional independenee. 
This results from the clustering engendered by the DP prior specifieation, since the data eontain 
no information about {5, B'} jointly. This finding underseores the potential benefits of using glue 
when using latent elass models for data fusion. 


3.2 Incorporating glue in data fusion 


Sehifeling and Reiter (2015) developed a strategy for ineorporating prior information about marginal 
probabilities into the DPMPM. They suggest eonstrueting a hypothetieal dataset that represents 
prior beliefs, appending it to the eolleeted data, and estimating the latent elass model with the eon- 
eatenated real and hypothetieal data. As an example, if one knows only that the true proportion 
of women in a population is exactly 50%, one can append a large hypothetieal dataset with equal 
numbers of men and women with all other variables missing. [Sehifeling and Reiter (2015) show 
that this approaeh fixes the posterior probability of being female at 50% without distorting the 
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conditional distributions of other variables on gender. 

We adapt this strategy to ineorporate glue in data fusion. We assume that the analyst has glue 
data, Ds, in whieh some subset of the {B, B'} variables, possibly with A, is measured. For indi¬ 
viduals i = 1,..., 77-5 in let be the p x 1 veetor of measurements for the zth individual. In 
most data fusion seenarios, eaeh Yi will be ineomplete by design, in that only some variables are 
available in D^. We assume that for individuals in Dg follows the model in ([T]) - Q. Thus, we 
eoneatenate {Di, D 2 , Dg) in one file, and estimate the DPMPM model using MCMC. The infor¬ 
mation on {A, B, B'} available in Dg influenees the parameter estimates, resulting in imputations 
of missing B variables in D 2 and B' variables in Di that refieet the dependenee relationships in 
the glue. For eomputational eonvenienee, when fitting the MCMC we impute missing values in Di 
and D2, but not those in Dg. 

The ideal glue ineludes data on all variables in (A, B, B') and is a sample from the distribution 
of {A,B,B') in the population of interest. In praetiee, glue may be available only on subsets 
of variables, sueh as {B, B'). In addition, Dg may not be representative of the population. For 
example, in the HarperCollins and CivieSeienee data fusion, only the eonditional distributions 
P{B I A, B') ean be plausibly eonsidered representative. 

To investigate the potential benefits of glue in these seenarios, we use three sets of simulation 
studies. First, we add glue on different subsets of variables to explore the intuition that rieher glue 
(i.e., glue that eontains more variables simultaneously observed) results in larger improvements in 
inferenee. Seeond, we analyze the sensitivity of inferenee to the addition of varying amounts of 
data subjeets in the glue. Third, we study the validity of inferenees when using glue that is not 
representative of the population distribution of (A, S, B'). We also present a method for appro¬ 
priately ineorporating sueh information. We note that eaeh of these issues arises when using the 
CivieSeienee data as glue. 
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Table 1: Variables eontained in the HarperCollins survey used for simulations. 


Variable 

Group 

Levels (Level Label) 

gender 

A 

male (1), female (2) 

age 

A 

18-24 (1), 25-34 (2), 35-44 (3), 45-54 (4), 55-64 (5), 65+ (6) 

work status 

A 

emp FT (1), emp PT (2), homemaker (3), retired (4), self-employed (5), other (6) 

ineome 

A 

<25K (1), 25-45K (2), 45-75K (3), 75-99K (4), 100-i-K (5), won’t say (6) 

eBook 

B 

yes (1), no (2) 

hours 

B' 

< 1(1), 1-4 (2), 5-r (3) 


3.3 Simulation studies with representative glue 


We simulate fusion settings using a third HarperCollins survey eontaining 4,000 respondents and 
1, 056 variables. As the A variables, we seleet demographies ineluding gender, age, work status, 
and ineome. As the B and B' variables, we seleet eBook reader ownership and number of hours 
spent reading per week, respeetively. Table deseribes the variables in detail. We ereate Di by 
randomly seleeting half of the 3, 567 eomplete eases and removing reading hours, and ereate D 2 
as the remaining half of the eomplete ease data with eBook reader ownership removed. We are 
interested in fusing Di and D 2 to estimate the relationship between eBook reader ownership and 
reading hours per week, eonditional on speeifie demographies variables. Beeause we have the 
eomplete observations of {A, B, B'} in the original data, we ean eompare results from data fusion 
to the ground truth. 


To quantify the potential for glue in this example, we investigated the Freehet bounds (D’Orazio 


et al. 2006) on P{B = j,B' = k) for j = 1,2 and k = 1,2,3, as implied by the marginal 
distributions P{A,B) and P{A,B'). If these bounds are tight, signifying the probabilities are 
highly eonstrained by the observed marginal probabilities P{A, B) and P(A, B'), then little is to 
be gained from ineorporating glue. Conversely, if the bounds on the eell probabilities of P(i?, B') 
are wide, glue has the potential to greatly improve inferenees based on P{B, B'). Note that the 
marginal distributions P{B) and P{B') themselves eonstrain P{B, B'). The Freehet bound widths 
on the six eell probabilities ranged from 0.163 to 0.169. This implies that even with observing 
{A, B} and {A, B'} there remains a lot of uneertainty about {B, B'}, and potentially mueh to be 
gained from eolleeting glue. 
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3.3.1 Glue richness 


We consider four types of glue for Dg. In increasing order of richness, these include only the 
marginal distribution the joint distribution of {Ag, B, B'} where Ag represents gen¬ 

der, the joint distribution of {Aa, B, B'} where Aa represents age, and the joint distribution of 
[Ag, AaiB^B'}. In each case, we create glue by duplicating the appropriate variables for all re¬ 
spondents in the original survey; thus, Ug = 3567. We run the MCMC chains long enough to 
obtain 120, 000 posterior samples of all parameters. From these runs, we sample m = 50 com¬ 
pleted datasets, (Hj', D^), which we use in multiple imputation inferences. 

To evaluate the impact of glue richness, we compare Hellinger distances, which are commonly 


used to quantify the similarity between two probability distributions (Pollard, 2002; Gibbs and Su 


20021. Hellinger distances based on {^4, B, B'} reflect the accuracy of the entire estimated joint 


distribution P{A, S, B'), which arguably is the most important level of validity a fusion process 
can achieve (|Rassler[ 20041. For two discrete distributions P and Q taking on k values with proba¬ 


bilities (pi,... ,pfc) and (gi,..., qu), the Hellinger distance is given by 

This quantity is between zero and one, where smaller values imply more similarity between the 
distributions. Because the richest type of glue contains observations on {Ag^ Aa, B, B'}, we com¬ 
pute Hellinger distances between the empirical distribution of {Ag, Aa, B, B') based on the original 
complete survey and the corresponding posterior inferences. Calculations of distances based on the 
joint distribution {A, B, B') including all demographic variables, rather than just {Ag, Aa, B, B'), 
yield similar patterns. 

Table displays the posterior means and 95% credible intervals for the Hellinger distances 
between the empirical distribution of {Ag, Aa, B, B') and the corresponding posterior estimates. 
The results indicate that using glue can yield significant gains in accuracy, with increasing gains 
with richer glue. These results also suggest that gender offers smaller gains than age, a consequence 
of the fact that the distribution of {B, B'} is more similar across gender than age. This finding is 
evident in all of the evaluations that follow. Table also displays results from a set of fused data 
files using an exact matching algorithm based on all variables in A. The empirical joint probability 
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distribution is comparable to that produced from the latent elass model with no glue. 

We also compare the sum of the absolute differenees between the eounts in the true eontingeney 
table for {Ag^Aa^B^B'} based on the original eomplete data file and those based on imputed 
complete data files. These eounts, when divided by two, indicate how many individuals the model 
plaees in ineorreet eells of the empirieal eontingeney table. We approximate the expeeted number 
of “misclassified” individuals in an imputed data set with the empirieal average over 50 imputed 
data files. Mathematieally, the approximation for the expeeted number of misclassified individuals 
ean be expressed 



.. 50 / IlLi 

M E o-s E 1 "^ - 

m=l y j=l 


where is the number of individuals in eell j in the mth imputed data set and nj is the true 
number of individuals in the original eomplete data set. Table shows similar patterns as Table 
using glue improves over approaehes that assume eonditional independenee, with inereasing gains 
as the glue beeomes rieher. We note that adding gender information to glue already eontaining age 
does not lead to much improvement in imputation aecuraey. 

As a more focused evaluation, we use the eompleted datasets to estimate a logistie regression 
of eBook reader ownership on reading hours and the demographies variables. The model includes 
terms for all main effeets for all predietors, pairwise interactions between reading hours and gender 
and reading hours and age, and the three way interaetion among reading hours, gender, and age. 
Letting Ai represent ineome and A^ represent work status, the link function can be expressed as 

6 6 

logit(p(B = 1)) = A + l3n{A, = 2) + = k) + = k) 

k=2 k=2 

6 3 

+ /3il(.4i = i) + /3jl(S' = k) + I3<"'1(A, = 2, S' = 3) 

k=2 k=2 

+ = 6,5' = 3) + ^^‘^HiAg = 2,Aa = 6, B' = 3). 
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Table 2: Posterior mean and 95% eredible intervals for the Hellinger distanee between the true 
and estimated probability table for {Ag, Aa, B, B') under five different glue scenarios, as well as 
the estimate obtained from a fused data set under statistical matching. *The range of Hellinger 
distances across 10 perfect matchings is reported to quantify matching uncertainty. 



mean 

95% Cl or range* 

No glue 

.104 

(.094,. 113) 

{B,B'} 

.083 

(.075,.091) 

{B, B', A,} 

.077 

(.071,.084) 

{B,B',Aa] 

.060 

(.053,.068) 

{B,B’,Ag,Aa} 

.052 

(.047,.059) 

Exact matching 

.100 

.090- .107 


We estimate the coefficients from the 50 completed data sets using the standard multiple impu¬ 
tation combining rules (Rubin 19871. As displayed in Figure[^ 18 of the 22 regression coefficients 
based on the original data are contained in the 95% MI confidence intervals under the data fusion 
model applied with no glue. All intervals contain the original data coefficients when glue includes 
{Aa, B, B'} as well as [Ag, Aa, B, B'}. Adding glue with only {B, B'} improves the estimates of 
the main effects associated with B' (reading hours). Adding glue with at least {Aa, B, B'} results 
in further improvements, in particular resulting in more reliable estimates of the interaction term 
associated with Aa x B' (age x hours). Clearly, even targeted inferences can be improved by 
collecting glue, with generally increasing gains with richer glue. 


3.3.2 Glue size 


In Section |J. 3. 1[ the glue sample size was equal to the total survey sample size, that is, Us = n = 
3567. Generally, this will not be the case. To evaluate the role of glue sample size, we repeated the 
simulations using {Ag, Aa, B, B'} as glue with different sample sizes for Dg. As shown in Table 
1^ as expected, more high quality glue observations result in more accurate estimates with less 
uncertainty. Data fusion with rig = 1784 glue cases yields inferences that are close to the ground 
truth and to the inferences produced with more glue cases, suggesting that even modest amounts 
of glue can improve inferences. 
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no glue 



true regression coefficients 


glue on {ebook, hours} 


glue on {gender, ebook, hours} 




true regression coefficients 


true regression coefficients 


glue on {age, ebook, hours} 


glue on {gender, age, ebook, hours} 




true regression coefficients 


true regression coefficients 


Figure 1: Point estimates and 95% confidence intervals for estimated versus true regression co¬ 
efficients under five different glue scenarios. The first plot refers to the no glue scenario, and 
highlights terms which are affected by adding glue. These same 4 terms are highlighted in the 
remaining plots as more glue is added. 
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Table 3: Average number of individuals in the incorrect cell of the contingency table across the 
complete data sets under five different glue scenarios and under statistical matching. Ten complete 
data sets were considered for the statistical matching procedure. 




no glue 

318.5 

{B,B'} 

250.5 

[B.B'.A,] 

247.0 

{B,B',Aa} 

199.5 

{B,B',Ag,A,} 

196.0 

Exact matching 

315.0 


Table 4: Posterior mean and width of 95% credible intervals for the marginal bivariate distribution 


oiP{B,B') 


inder three different glue sam 

pie sizes. 




truth 

= 0 

Ug = 1784 

Ug = 7135 

P{B = 1,B' = 1) 

.037 

.077 (.021) 

.042 (.018) 

.040 (.009) 

pIb = 2,B' = 1) 

.363 

.333 (.041) 

.357 (.033) 

.362 (.020) 

p\b = 1,B' = 2) 

.064 

.067 (.016) 

.072 (.019) 

.066 (.011) 

pIb = 2,B' = 2) 

.252 

.248 (.036) 

.247 (.030) 

.251 (.019) 

p\b = 1,5' = 3) 

.096 

.062 (.017) 

.089 (.020) 

.093 (.012) 

P(5 = 2,5' = 3) 

.186 

.213 (.036) 

.192 (.027) 

.188 (.017) 


3.3.3 Nonrepresentative glue 


While glue obtained from non-probability samples like CivicScience polls is convenient and in¬ 
expensive, it generally is not representative of the joint distribution of {A, B, B'} in the target 
population for (Di, D 2 ). For example, Dg may disproportionately represent some demographic 
groups compared to their shares in {Di, D 2 ). When the concatenated data {Di, D 2 , Dg) is not a 
(incomplete) draw from P{A, B, B'), the posterior distributions of the DPMPM model parameters 
will not produce accurate estimates of P(A, B, B'). The resulting imputations will be draws from 
a biased estimate of P(A, 5, B'), which can diminish or even negate the benefits of using glue. In 
various simulations, not reported here to save space, we found that significant problems can arise 
when appending nonrepresentative glue, even when the glue is representative of the population in 
terms of P{B, B'\A) but not representative in terms of A. 

When Dg is not representative of the population, one still can construct useful glue provided 
that either P{B \ B',A) or P{B' \ B,A) in Dg is a draw from the corresponding conditional 
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distribution in the population. The analysis proeeeds as follows. 


1. Fit the DPMPM model to Dg alone to estimate P{A,B,B'), from whieh one ean obtain 
P{B\A,B') and P{B'\A,B). 

2. Construet glue D* by duplieating or sampling records {A, B} with replacement from Di, 
or duplicating or sampling records {A,B'} with replacement from D 2 , and imputing the 
missing values of B' from {B'\A, B} and the missing values of B from {B\A, B'} based on 
the conditional distributions from step (1). 

In this way, the constructed glue appropriately reflects the marginal distribution of A and the infor¬ 
mation in the conditional distributions. With glue representing the appropriate joint distribution, 
we are in the scenarios described in Section U.3.1l and Section U.3.2[ 

To assess the validity of the assumptions that P{B\A, B') and P{B'\A^ B) from Dg are rep¬ 
resentative of the population of interest, analysts can compare the empirical distributions of the 
sampled B and B' variables in step (2) to those from Di and D 2 . When these empirical distribu¬ 
tions differ greatly, the assumptions of conditional representativeness of the glue may be inappro¬ 
priate, and the glue is not useful for data fusion. When only one conditional distribution, either 
P{B\A^ B') or P{B'\A^ B), seems reasonable, the glue can be constructed using that conditional 
distribution only. Analysts can choose the number of records in the constructed D* to reflect their 
level of certainty about the conditional distributions. As a default, we recommend using the same 
sample size as the collected Dg. 

We now illustrate that this diagnostic procedure can detect whether or not glue is representa¬ 
tive on P{B I A,B') or P{B' \ A,B). We consider a setting in which Dg is representative on 
P{B I A,B') but not on P{B' \ A,B), constructed as follows. For {Ag,Aa}, we over-sample 
women and older individuals by keeping all observations with Ag = 2 or Aa > 4, and sample each 
of the remaining observations with probability 0.5. This results in Ug = 2, 837 auxiliary cases. We 
sample each record’s B' from {1, 2, 3} with probabilities (0.7, 0.15, 0.15). This is highly nonrepre¬ 
sentative, as the true marginal probabilities are (0.41, 0.32, 0.27). We sample each record’s B from 
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{1, 2} with probabilities given by the empirieal P{B \ A, B') from the original data. Thus, Dg is 
representative in terms of P{B \ A, B'), but not on P{B' \ A, B) or any marginal distributions. 
We fit the DPMPM model to Dg to estimate P{B \ A, B') and P{B' \ A, B), as deseribed in step 
(1), and eonstruet D* as deseribed in step (2). The resulting marginal distribution for the imputed 
B is extremely elose to the empirieal distribution of B from Di, with differenees of only 0.01. The 
marginal distribution for imputed B' is (0.57, 0.23, 0.20), quite far from the original data values. 
The diagnostie suggests that P{B' \ A, B) is not representative, whereas it may be reasonable to 
rely on P{B \ A, B'). 

4 HarperCollins data fusion with CivicScience glue 

We now turn to the HarperCollins data fusion. We seek to eombine information from two surveys. 
In Di, HarperCollins asked ni = 2,000 respondents questions related to the diseovery of new 
authors, e.g., “Do you beeome aware of an author by [medium]?” for different mediums]^ In Z? 2 , 
HarperCollins asked 77-2 = 5,015 different people about their interest in various authors. Eaeh 
person was asked about different subsets of authors, so D 2 ineludes many missing values. We let 
B represent author diseovery via the mediums Best Seller List, Faeebook, library, online, reeom- 
mendations, and bookstore. We let B' represent interest in the authors Shel Silverstein, Agatha 
Christie, Suzanne Collins, Stephenie Meyer, and Lisa Kleypas. Eaeh Bj is reeorded as yes or no. 
Eaeh 5' is reeorded as one of three eategories, namely read, interested, or not interested. Both 
Di and D 2 eontain the demographie variables age, gender, and ineome, all of whieh are of strong 
interest to HarperCollins for market segmentation. Our goal is inferenee on relationships between 
diseovery medium and author interest, in partieular on the distributions P{B\B'), P{B, B'), and 
P{B,B'\A). 

We provided CivieSeienee with a list of questions to ask in one of their surveys, with the goal of 
proeuring glue. CivieSeienee eolleeted rig = 2, 730 simultaneous observations on author diseovery 
and interest, along with age and gender for many (but not all) respondents. There are some key 

^Although the survey contained 4,001 respondents, only half were asked about author discovery. 
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differences between the data collected by CivicScience and those in the original HarperCollins 
surveys. In particular, the CivicScience respondents tend to be older; over 60% are 55+ years 
old compared to only 30% of HarperCollins respondents (see Figure [^. We conjecture that is a 
consequence of the voluntary nature of the internet data collection done by CivicScience. We note 
that the distributions of A variables in Di and D 2 are very similar. 

As discussed in Sectionit is not prudent to proceed with data fusion by appending the non¬ 
representative sample from the CivicScience survey to (Di, D 2 ). We therefore construct D* that 
reflects the marginal distribution of {A, B'} in D 2 and the conditional distribution P(B \ A, B') 
estimated from the collected CivicScience data, following the procedure for non-representative 


glue described in Section 3.3.3 We first duplicate {A, B'} from D 2 , and then sample values of 
{B\A, B'} for these duplicated records using a DPMPM applied to the CivicScience data. As 
evident in Figure]^ the empirical probability distributions for the observed values of B in Di and 
the sampled values of B from P{B \ A, B') are similar, suggesting that it is not unreasonable to 
use the CivicScience data to estimate P{B \ A, B'). We also considered creating D* by duplicating 
{A, B} from Di and sampling {B'\A, B} for the duplicated records. However, as shown in Figure 
the sampled marginal distributions for B' do not closely match the empirical distributions in D 2 - 
We therefore do not assume {B'\A^B} in the CivicScience data is representative, and construct 
D* only from the duplicated {A, i?'} sample from 02- 

After appending the constructed D* to {Di, D 2 ), we estimate the DPMPM model on the con¬ 
catenated data. In the process we impute all missing values in Di and D 2 . As in the simulation 
studies, we keep m = 50 of these completed datasets, spacing them far apart in the MCMC it¬ 
erations to ensure approximate independence. We use the completed versions of Di and D 2 for 
multiple imputation inferences. 

As a first data fusion inference relevant for marketing strategies, we estimate probabilities of 
discovery via a given medium for those who have read or are interested in reading a particular 
author. As evident in Figure high income individuals appear very likely to discover books via 
recommendations regardless of author. Low income individuals are also likely to discover books 
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Figure 2: Age distributions from the HarperCollins (dark gray) and CivicScience (light gray) sur¬ 
veys. 




Figure 3: Left: Empirical probabilities assigned to category 1 (‘o’ symbol) and category 2 (‘x’ 
symbol) for each of 6 discovery questions by sampling B as implied by inference for P{B\A^ B') 
from the CivicScience data versus marginal distributions of B from the survey data. Right: Em¬ 
pirical probabilities assigned to category 1 (‘o’ symbol), category 2 (‘x’ symbol), and category 
3 (‘o’ symbol) for each of 6 author interest questions by sampling B' as implied by inference for 
P{B'\A^ B) from the CivicScience data versus B' from the survey data. 
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through recommendations, but the extent to which this is the case is more variable by author; for 
instance, low income individuals who have read Christie are more likely to discover new books 
via recommendations than those who have read Collins. Among individuals who have read Meyer, 
those with high incomes are very likely to discover books at the library, whereas those with low 
incomes are not. Low income individuals appear more likely to discover books via the Internet than 
high income individuals for readers of all authors except Kleypas. In fact, low and high income 
individuals who have read Kleypas do not appear to differ in terms of discovery. 

We also look at author discovery conditional on reading interest and age, as opposed to income. 
Figuredisplays inference for P{B = yes|i?' = read, age) across age groups for three different 
combinations of discovery mediums B and authors B'. There appears to be an increasing trend in 
discovery via Best Seller List for those who have read Meyer. In other words, older individuals 
who have read Meyer are more likely to discover new books through the Best Seller List than 
younger individuals. Quadratic trends are present for discovery via the Internet for those who have 
read Silverstein and in discovery via Bookstores for those who have read Collins. As evidence of 
the impact of glue. Figure also displays the multiple imputation point estimates obtained from 
the DPMPM model fit without using the CivicScience data. In some cases these estimates agree 
in terms of the trends they suggest (e.g., the middle figure) but sometimes there are fairly stark 
differences, such as in the leftmost figure. 

Finally, we estimate the conditional distributions P{B \ B') for particular discovery mediums 
and authors. Figure [^displays these probability distributions for authors Silverstein and Christie, 
under models applied with and without glue. It appears that fans of Silverstein’s books use Face- 
book to find out about new books more frequently than fans of Christie’s books; however, both 
readerships rely on the Best Seller List equally. We note that the glue impacts inference for even 
these marginal probabilities. 
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low income 


high income 



Figure 4: Multiple imputation point estimates for P{B = yes|-B' = read, income) for low and 
high income groups and all mediums B and authors B'. Black indicates larger probabilities, and 
white indicates smaller probabilities. 



Figure 5: Multiple imputation point estimates and 95% confidence intervals for P{B = yes|i?' = 
read, age) across age groups for three different combinations of mediums B and authors B'. Open 
circles refer to the estimates under the DPMPM model applied without any glue. Left: Probability 
of discovery via Best Seller List given one has read Meyers. Middle: Probability of discovery 
Online given one has read Silverstein. Right: Probability of discovery via Bookstores given one 
has read Collins. 
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Figure 6: Posterior mean estimates for P{B = yes | B' = read) for B representing each of 5 
mediums and B' representing Silverstein (dark gray) and Christie (light gray) under the model 
applied with glue (left) and without glue (right). 


5 Concluding remarks 


While useful for marketing purposes in their own right, the results of the HarperCollins and Civic- 
Science data fusion offer some general lessons about integrating online and traditional survey data. 
First, it is possible to improve inferences by collecting glue, even when the additional data include 
only portions of the full joint distribution of interest. However, crucially, the glue and survey data 
should represent the same distribution. Second, data from online polling companies like Civic- 
Science, not surprisingly, are likely to be not representative on some dimensions. However, when 
one believes that conditional distributions in the polling data are reliable, one can construct appro¬ 
priate glue from the conditional distributions, as we did in the HarperCollins data fusion. Third, 
it is important to understand the limitations of the online data. For example, the CivicScience 
data include very few young people. Thus, the estimate of P{B \ A,B') from the CivicScience 
data when A refers to a young person has high variance, so that the glue may not offer adequate 
information about young people. 
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The simulations with the HarperCollins data also point to interesting directions for future re¬ 
search. In those simulations, adding gender to glue already containing age does not noticeably 
improve the inferences. In practice, one would expect the cost of collecting glue to increase with 
the number of variables; hence, in this simulated fusion context, it may not be cost effective to 
collect gender as part of the glue. This suggests a benefit for research on methods for selecting the 
variables that most improve the accuracy of data fusion, taking into account the cost of obtaining 
those variables. 
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A Posterior computation 


In order to obtain inference under the hierarchical model, we use a Gibbs sampler to simulate 
from the posterior distribution V, a, I data), where refers to all missing 

values in F* = (Ai, Bi, 5') from Di and D 2 , and data refers to all observations of (Aj, Bi, B[) 
in Di, D 2 , and Dg. For computational expediency, we need not impute missing values for Dg, 
as we are simply using this data to inform nonidentifiable relationships. However, it would be 
straightforward to impute these missing values just like we impute missing values in Di and ^ 2 - 
We now describe the posterior full conditionals for all model parameters. 


Full conditional for Z 


The mixture allocation variables Zi, for i = 1,..., n, are updated from categorical distributions 
with probabilities given by 


p{Zi = h I Fj,7r,0) 




lU) 


Ef=ivr.n;=r 


(i) 

kYij 


( 5 ) 


for h = 1,... ,N. For the glue cases, let Jj represent the variables in {1,..., p} that are observed 
for glue case i. The variable Zi,i = 1,..., n^, is updated from a categorical distribution with 


p{Zi = h I Fi,7r,0) 


rije J, ^KYjj 

Y.k=i rijej, 0 


(i) 

kYij 


( 6 ) 


for /i = 1,..., iV. 


Full conditional for 

To update fox h= 1,..., iV, and j = 1,..., p, sample from a Dirichlet distribution: 


^ I Y^°^^\Z) oc Dirichlet 



{i:Zi=h} 



, (V) 
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where the summations are over all survey and glue eases, i G {l,...,n + ns}. 

Full conditional for V 

The stiek-breaking proportions 14, for h = 1,..., N — 1, ean be sampled from Beta distributions: 

p(Vh I a, Z) oc Beta 114; + 1, a + ^ Mjj , (8) 

V j=h+l ) 

where Mh = = ^)- Fixing 14r = 1, the probabilities tt are given by tti = 14 and 

n?= ? (1 - F,') for h = 1,..., iV. 

Full conditional for a 

The DP preeision parameter a ean be sampled from a Gamma distribution: 

p{a I V) oc Gamma (a; + Oq — 1, 6q, — logijN)) ■ (9) 

Imputing 

Missing Yij in Di and D2 ean be imputed by sampling from eategorieal distributions with the form 
given in equation Q. 
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